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Introduction 


The study of scientific reasoning ability (SRA) is one of the frequently 
discussed topics in science education. In decades, researchers and educators 
have devoted great efforts to exploring how students develop their scientific 
reasoning ability and how their scientific reasoning ability affects their learn- 
ing achievements (Coletta, Phillips & Steinert, 2007; Ding, 2018; Johnson & 
Lawson, 1998). In science education, the development of scientific reasoning 
ability is an essential goal that has long been pursued and prioritized (Bao et 
al., 2009; Engelmann et al., 2016). Thus, one of the primary tasks of science 
education is to cultivate students into good reasoners and become scientific 
literate (Lawson, 2004). K-12 science educators have also made great efforts to 
foster students’ scientific thinking and reasoning through their engagement 
in familiar phenomena in daily life contexts (Kind & Osborne, 2017; van der 
Graaf et al., 2019). To cultivate student reasoning ability in and for science 
learning, many studies have investigated the nature of reasoning (Driver et 
al., 1994) and its development via teaching practices (Lawson, 2004; Zimmer- 
man, 2000; 2007). Leveraging these research efforts, a batch of assessments 
of scientific reasoning using standardized tests have been constructed and 
implemented. These assessments, which mainly focused on the evaluation 
of the level and complexity of reasoning involved in the processes of solving 
science problems, can inform and improve educational practices of science 
(Kind, 2013; Kalinowski & Willoughby, 2019; Lee & She, 2010). However, the 
valid and reliable assessments of scientific reasoning ability are still needed 
to diagnose research gaps and identify areas of improvement in science 
learning and teaching. 

The recent discussion on scientific reasoning in science education uncov- 
ers and underscores the significance of evidence use. Evidence is viewed as 
the premise and basis of valid reasoning in the fields of logic and cognitive 
psychology (Toulmin, 2003), and the collection and analysis of evidence are 
necessary and vital for the formation of scientific reasoning and thinking 
(Kanari & Millar, 2004). Scientific evidence and evidence-based reasoning 
should be the core of students’ science learning experiences (Duschl, 2003). 
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Abstract. Scientific reasoning ability (SRA) 
is widely recognized as an essential goal for 
science education. There is much discussion 
on the design and development of assess- 
ment frameworks as viable tools to foster 
SRA. However, established assessments 
mostly focus on the level of students rea- 
soning attainment. Student ability to use 
evidence to support reasoning is not ad- 
equately addressed and evaluated. In this 
study, the 6-level SRA assessment frame- 
work was conceptualized and validated 
iteratively via synthesizing literature and a 
Delphi study. Guided by the framework, an 
SRA assessment tool adopting and adapt- 
ing PISA test items and self-created items 
was developed and administered to 593 
secondary students (including 318 8" Grad- 
ers and 275 9" Graders) in mainland China. 
Pearson correlation analysis of SRA assess- 
ment score and their scores in scientific 
reasoning provided criterion-related valida- 
tion for the former (Pearson correlation = 
.527). Rasch analysis conducted further 
confirmed the validity and reliability of the 
SRA test and the assessment framework. 
Combing quantitative and qualitative 
methods, the study provides a valid and 
reliable analytical framework of SRA. It 

can inform the design of SRA assessments 
in various science education contexts for 
diversified audiences. 
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The framework for K-12 science education adopted in the United States regards “engaging in argument from 
evidence’, which was based on the reasoning ability to evaluate evidence about correlation and cause, as one of 
the eight major practices in science and engineering. The ability and processes of collecting data based on the 
observation and formulating evidence are key to scientific inquiry and should be emphasized (National Research 
Council, 2012). 

However, in existing scientific reasoning assessments, how students use evidence to support their reason- 
ing processes is inadequately considered and evaluated (Osborne, 2013). Most of the reasoning assessments 
were based on the use of argument pattern (Toulmin, 2003), and test with written assignments (Yanto et al., 
2019; Zhou et al., 2016), classroom discussion (Osborne et al., 2004), and student interviews (Adey & Csapo, 
2012; Jimenez-Aleixandre et al., 2000). These assessments revealed students’ ability to make correct claims and 
rational conclusions but did not reflect their ability to capture evidence (Sandoval & Millwood, 2005). In a series 
of studies, an analytic framework of Evidence-Based Reasoning (EBR), which also based on Toulmin’s argument 
pattern was established and applied to assess students’ ability to reason from evidence using writing tasks and 
classroom discussion. According to this EBR framework, the reasoning is the processing of two kinds of inputs 
(i.e. data and premise) through three steps of data analysis, evidence interpretation, and rules application to 
form a claim as the final output (Brown et al., 2010a). Empirical data proved the validity of the framework and 
affirmed the possibility to evaluate students’ SRA based on evidence (Brown et al., 2010b). 

Motivated by both the inadequacy and achievement of previous research efforts, the present study aimed 
to conceptualize and validate an assessment framework with design and development of an assessment tool 
of SRA that combines the complexity of reasoning and the use of evidence. Specifically, the SRA framework was 
developed based on the analysis of existing assessment models and the use of a Delphi study which engaged 
experts in science education. Building on the SRA framework, an assessment tool that incorporated test items 
from PISA and self-developed items was compiled and implemented. Pearson correlation analysis of the SRA test 
scores of 593 secondary students (including 318 8th Graders and 275 9th Graders) and their scores obtained ina 
classic test of scientific reasoning (i.e., Lawson's Classroom Test) provided criterion-related validation for the SRA 
assessment. And the validity and reliability of the SRA assessment framework and test were further confirmed 
by Rasch analysis results. The assessment developed in this study provides a valid analytic framework of SRA 
that can contribute to the design of assessments and educational practices of SRA across a wide spectrum of 
educational settings. 


Theoretical Framework 
Scientific Reasoning: Grounded in Evidence 


In the perspectives of science education, guided by the goal of cultivating students to be “better reasoners 
in a general sense and become scientifically literates” (Lawson, 2004), scientific reasoning refers to the ability to 
systematically investigate a problem concerning science, formulate hypotheses, and test them, control and ma- 
nipulate variables, and evaluate experimental outcomes, such as data results, and make explanations (Bao et al., 
2009; Zimmerman, 2000; 2007). 

Even though researchers held different opinions based on various perspectives, it was believed that scientific 
reasoning is constrained by laws (Moshman, 1998) that should deliberately consider the contextual correlations 
among assorted information, and consciously coordinate theory and evidence (Kuhn et al. 1995). From the perspec- 
tive of scientific discovery, Klahr and Dunbar (1988) proposed a dual search model to describe scientific reasoning 
as search in hypothesis space and an experiment space. The model includes three components: search hypothesis 
space; test hypothesis; and evidence evaluation. That was similar to Kuhn’s (1995) phases of knowledge acquisition, 
including evidence accumulation and evaluation to provide an explanation and conclusion. 

By strongly supported evidence and premise, students arrive at sub-claims if they are involved in a more 
complicated reasoning process (van Eemeren et al., 2002). Each sub-claim should be connected with appropriate 
evidence, be it new or additional. In such a way, students will accomplish reasoning and succeed in argumenta- 
tion (Belland et al., 2008). In other words, what enlightens us is that conducting SR refers to an integrated process 
whose complexity depends on specific problems to be solved, and such ability enables an individual to persuade 
the audience (i.e.teachers, peers, and community leaders) to agree with their claims or solutions to specific prob- 
lems (Hmelo-Silver, 2004). 
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Conceptualizing the Framework for Assessing SRA: Evidence Complexity and Reasoning Complexity 


In science research, evidence is usually derived from observations and experiment results to support, 
modify, refute, or form scientific hypothesis or theories (National Research Council, 2012), and should be ac- 
curate, confirmed, and leading to necessary consequences that are “distinct from observations on which it 
is based on and the principle it is intended to illustrate” (Brown et al., 2010a). In educational settings where 
science-related problems are to be solved (e.g. the classroom, laboratory or outside of school), the evidence is 
used for supporting claims or decision-making. And, as reflected in OECD reports, the contextual information, 
or the evidence, provided could impact how students solve the scientific problems (2006; 2016), and thus can 
be used to measure the levels of student SRA. The levels of students’ engagement and achievement in SRA are 
influenced and reflected by the nature and complexity of the problem context as defined by 1) its familiarity 
(i.e. familiar context vs unfamiliar context) (Choi & Hannafin, 1997), and 2) the explicitness (i.e. explicit evidence 
vs implicit evidence) (Salgado, 2016), and 3) the quantity of evidence (i.e. single evidence vs multiple evidence) 
embedded in. When there is only one piece of evidence (i.e. single evidence) that can be directly captured rather 
than deeply explored (i.e. explicit evidence) from the familiar context that resembles daily life experiences, the 
reasoning processes involved in the problem-solving processes are at the lowest level as the evidence engaged 
has least complexity. When students deal with multiple, implicit evidence in an unfamiliar context, they are 
involved in the most complex scientific reasoning processes (Dolan & Grady, 2010; Zhou et al., 2016). 

In cognitive science, many studies have investigated the extent to which they are capable of rational 
thought or acting rationally in different circumstances (Kyllonen & Christal, 1990). As mentioned before, most of 
the assessment frameworks of reasoning are based on Toulmin (2003)’s: the use of argument pattern. From the 
perspective of scientific learning and teaching, the level of reasoning refers to student ability to systematically 
investigate a problem concerning science, formulate and test hypotheses, control and manipulate variables, evalu- 
ate experiment outcomes (e.g. data results), and make explanations (Bao et al., 2009; Zimmerman, 2000; 2007). 

Grounded in theories of situated cognition, Dolan and Grady (2010) adopted a case study approach to con- 
struct a matrix for evaluating the Complexity of Scientific Reasoning during Inquiry (CSRI) by analyzing teaching 
practices. In this CSRI matrix, students’ cognitive processes of reasoning were categorized into four continuing 
levels based on complexity: least, somewhat, more, and most complex reasoning. Another significant analytical 
framework of scientific reasoning, the Lawson’s Classroom Test of Scientific Reasoning (LCTSR) has been widely 
adopted and applied since its conceptualization in 1978 and its further improvement in 2000 (Bao et al., 2009; 
Lee & She, 2010; Thompson et al., 2017). LCTSR, a paper-pencil based assessment composed of 12 paired, two- 
tier, multiple-choice test items, investigated SRA from six dimensions, which include conservation reasoning, 
proportional reasoning, control of variables, probability / probabilistic reasoning, correlation reasoning, and 
hypothetic-deductive reasoning. 

From a logical perspective, based on statements or propositions quantity, two kinds of inferences are 
distinguished. An immediate inference is an assumption, without intervening or “mediating” premises; a medi- 
ate inference is a logical inference drawn from more than one premise (Churchill, 1990). As the valid form of 
inference is the concerned and primary issue of logic, in this study, we shall pay more attention to the content 
of premises (evidence) and the relationship between premises (evidence) and conclusion. The content of 
premises represents the context and evidence sources, and the latter is the basic requirement of the reasoning 
process. We, therefore, focused our exploratory efforts on the holistic and scientific process of reasoning based 
on evidence to solve problems, and defined SR process as two aspects: direct reasoning and indirect reasoning, 
which is different from the definition and clarification in logic. In direct reasoning, the relationship between 
the evidence presented in the context of the science problem is quite simple and involves less complexity. The 
evidence involved in direct reasoning can be either single or multiple. The reasoning processes are more com- 
plicated if multiple evidence is coordinated. Yet in indirect reasoning, students manage complicated relations 
among multiple evidence, be it covert or overt (Leron, 1985). Such processes demand greater analytical and 
integrative skills. Table 1 presents the definitions of 3 levels of reasoning based on its complexity. Levels 1 and 
2 indicate the direct reasoning requirement. However, level 2 is based on multiple pieces of evidence. Level 3, 
an advanced level of reasoning, which deals with multiple pieces of evidence with complicated connections 
and demands students’ analytical and integrative skills. 
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The Reasoning Complexity: level and definition 


Reasoning complexity level Definition 


Students recognize and extract single evidence (S) from the context, reasoning directly based on 


Level 1: Direct reasoning - 1 
evidence. 


Students recognize and extract multiple pieces of evidence (M) from the context, establish simple 


Level 2: Direct reasoning - 2 relations between/among the evidence. 


Students recognize and extract multiple pieces of evidence (M) from the context, establish 


Level 3: Indirect reasoning complicated relations between/among evidence. 


Evaluating SRA Combing the Complexity of Reasoning and Evidence 


Integrating evidence complexity (EC) and reasoning complexity (RC), the framework for assessing and defining 
student SRA in science learning, particularly in science-related problem solving, was initially constructed based 
on four indicators, they are 1) context familiarity (familiar context vs unfamiliar context), 2) evidence explicitness 
(explicit evidence vs implicit evidence), 3) evidence quantity (single evidence vs multiple evidence), and 4) rea- 
soning complexity (direct reasoning vs indirect reasoning) that collectively measure the complexity of scientific 
reasoning. For details of the initial SRA assessment framework, please refer to Table 2. In Table 2, the EC in eight 
kinds of combinations is matched to the different levels of RC. The CSR level has been formulated in a connective 
way. As discussed above, the unfamiliar context will improve the complexity of evidence but not as much as the 
impact of implicit evidence. Thus, SEU and SEF refer to the same level of CSR (Level 1a). CSR is divided into nine 
levels by the complexity levels of reasoning. 


Table 2 
The Complexity of Scientific Reasoning (CSR) framework (initial version) 


Reasoning 


Evidence complexity Remarks Level of CSR 
complexity 
SEF: Single-Explicit-Familiar The lowest level of complexity 
Level 1a 
level SEU: Single-Explicit- Unfamiliar Unfamiliar evidence adds a little complexity a little 
(Direct reasoning -1) SIF: Single-Implicit-Familiar Implicit evidence adds to the complexity (more than U) 
Level 1b 
SIU: Single-Implicit-Unfamiliar |&U add to the complexity 
MEF: Multiple-Explicit-Familiar Establish simple relations 
Level 2a 
Level 2 MEU: Multiple-Explicit-Unfamiliar Establish simple relations; U adds a little complexity 
(Direct reasoning -2) MIF: Multiple-Implicit-Familiar Establish simple relations; | adds to the complexity 
Level 2b 
MIU: Multiple-Implicit-Unfamiliar Establish simple relations; 1&U add to the complexity 
MEF: Multiple-Explicit-Familiar Establish complicated relations 
me Level 3a 
Level 3 MEU: Multiple-Explicit-Unfamiliar Establish complicated relations; U adds to the complexity 
(Indirect reasoning) MIF: Multiple-Implicit-Familiar Establish complicated relations; | adds to the complexity 
Level 3b 


MIU:Multiple-Implicit-Unfamiliar Establish complicated relations; |&U add to the complexity 
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Research Methodology 


A Delphi study was conducted to improve the content validity and face validity of the initial CSR framework 
proposed above (Osborn, 1963). Then an assessment tool using standardized test items was further developed to 
evaluate students’ SRA in solving science-related problems. 


The Delphi Method: The Modification of CSR Framework 


Following the Delphi method, expert opinions in collective intelligence were consulted to elaborate and 
improve the initial CSR framework. The first stage involved a brainstorming session that gathered selected experts 
“face to face” to provide opinions and comments (Isaksen, 1998). The expert group comprised of researchers and 
teachers specializing in science education. Altogether, one professor, four associate professors, two senior lectur- 
ers, one middle school principal, three science teachers, seven doctoral candidates, and several master students 
were recruited as the experts (22 in total). 

After the brainstorming session in which diversified opinions were mined and organized, another expert group 
was formed for the Delphi survey via e-mail. This group comprised of experts in the field of science education in- 
cluding four associate professors, one senior lecturer, six science teachers, and five doctoral candidates. Following 
the Delphi principles, all experts were positioned “back to back” to ensure the opinions elicited were independent 
rather than being influenced by each other (Rowe & Wright, 2001). The Delphi survey was conducted in three 
rounds. The first was an open consultation during which the selected experts shared their suggestions for and 
comments on the CSR framework. These responses were summarized and returned to the experts for the second 
round of clarification and commenting. In the following, the same process was administered, and the agreement 
was reached among all the experts. The reaching of consensus among the experts marked the achievement of the 
goal of the Delphi study (Bolger & Wright, 2011). 

Based on the insights obtained in the Delphi study, the initial version of CSR framework was revised accordingly 
(Table 3). According to the experts, during the reasoning processes, implicit evidence would add more complexity 
than unfamiliar context would; reasoning involving familiar context would be less complicated than with unfamiliar 
context, but such difference in complexity was not obvious enough to distinguish them into different levels of 
CSR. Also, for different students, the degree of familiarity with the context of a science problem would be different 
due to their unique life experiences. The identification of the context of a science problem as familiar or unfamiliar 
would be difficult. Therefore, context familiarity would not be an appropriate indicator of CSR. According to these 
opinions, the revised framework categorized students’ SRA into six levels. 


Table 3 
The Complexity of Scientific Reasoning (CSR) framework (revised version) 


Reasoning complexity Evidence complexity Remarks Level of CSR 
SEF: Single-Explicit-Familiar 
Unfamiliar evidence adds a little complexity Level 1a 
Level 4 SEU: Single-Explicit-Unfamiliar 
(Direct reasoning-1) SIF: Single-Implicit-Familiar Implicit evidence adds to the complexity (more than U) 
Level 1b 
SIU: Single-Implicit-Unfamiliar [&U add to the complexity 
MEF: Multiple-Explicit-Familiar Establish simple relations 
Level 2a 
Level 2 MEU: Multiple-Explicit-Unfamiliar Establish simple relations; U adds a little complexity 
(Direct reasoning-2) MIF: Multiple-Implicit-Familiar Establish simple relations; | adds to the complexity 
Level 2b 
MIU: Multiple-Implicit-Unfamiliar Establish simple relations; I&U add to the complexity 
MEF: Multiple-Explicit-Familiar Establish complicated relations 
Level 3a 
Level 3 MEU: Multiple-Explicit-Unfamiliar Establish complicated relations; U adds to the complexity 
(Indirect reasoning) MIF: Multiple-Implicit-Familiar Establish complicated relations; | adds to the complexity 
Level 3b 


MIU: Multiple-Implicit-Unfamiliar 
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Establish complicated relations; |&U add to the complexity 
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The Development of the SRA Assessment Tool 


For evaluating students’ general abilities such as SRA, the test designed and developed should not involve 
content knowledge as most established tests (e.g., LCTSR) did (Bao et al., 2009). As the test may be administered 
to students with different grades, the knowledge they acquired would be an extraneous variable and should 
be controlled accordingly. Thus, in designing the SRA assessment tool, the context of a science problem should 
not involve knowledge learnt from school but provide ample information for students to analyze, capture and 
transform evidence based on such information. In other words, the designed test items should involve abun- 
dant, varied contextual information that is related to science and represented as different forms of “evidence”, 
but do not contain specific science knowledge. This constitutes the most important principle that guided the 
test design. Also, the test items should be developed according to the CSR framework and more than one item 
should be included for each complexity level in case of the presence of inappropriate items. 

Besides self-developed items, the SRA test also incorporated PISA questions which evaluate students’ general 
science ability (OECD, 2006; 2015; Tamassia & Schleicher, 2002) and highly emphasize the problem context (Bybee 
et al., 2009; Fensham, 2009), and have been validated via several empirical studies (e. g, Dohn, 2007; Sadler & 
Zeidler, 2009). In total, the SRA test consisted of 25 items, with 12 multiple-choice questions and 13 constructed 
response questions. All the items were reviewed by experts in science education to ensure their content validity. 
Table 4 provides a summary of the complexity level of scientific reasoning involved in each test item. 


Table 4 
SRA test items and the corresponding complexity level based on the CSR framework 


Level of CSR Evidence complexity Items 

SEF P01, P04 
Level 1a 

SEU P08 

SIF P09 
Level 1b 

SIU P07, P20, P12 

MEF P13, P17 
Level 2a 

MEU P18 

MIF P05, P21 
Level 2b 

MIU P02, P06, P11 

MEF P10, P14, P19 
Level 3a 

MEU P22, P24 

MIF P03, P25 
Level 3b 

MIU P15, P16, P23 


Procedures of Evaluation Study of SRA test 


A small-scale pilot study was carried out to explore inadequacies of the SRA test. Through convenient sam- 
pling, 31 students including 16 6th Graders and 15 17th Graders; 12 boys and 19 girls) from different schools in 
Shanghai, China participated in the pilot test which was administered during an out-of-school activity by a science 
teacher and assisted by one doctoral candidate and two master students in science education for observation. 
After the test, there was a semi-structured interview to collect student feedback on the test. The data collected 
(including student test scores, observations, and student feedback) was used to modify the instrument. The SRA 
test was validated through being done by a much larger pool of students. The scores students obtained in the 
SRA test were compared and correlated to their scores obtained in the LCTSR, a widely adopted and extensively 
corroborated assessment for evaluating scientific reasoning (e.g., Bao et al., 2009; Lee & She, 2010). As SRA is a 
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construct closely related to scientific reasoning, and LCTSR a valid assessment, if a student scores in the SRA test 
correlate well with their LCTSR scores, this would defend the criterion-related validity of the SRA assessment. 

The participants of the LCTSR and SRA test were Grade 8 and 9 students from a junior middle school in 
Chizhou, China. Altogether, 582 students (309 8th Graders and 273 9th Graders; 306 boys and 276 girls) and 593 
students (318 8th Graders and 275 9th Graders; 306 boys and 287 girls) participated in the LCTSR and SRA test 
respectively. The two tests were done with a one-week interval. During the two 30-minute tests, each student 
did the printed test paper independently. Before the tests, all participants and their schools were invited to sign 
the consent form. After getting the consent approvals of the participants, the relevant information of the test 
was introduced. 

Rasch analysis was employed to examine whether the test really targeted and evaluated the construct of 
SRA and whether the CSR framework and the SRA test matched and corresponded to each other. Rasch model- 
ing is based on Item Response Theory (IRT), a “psychometric technique [that] was developed to improve the 
precision with which researchers construct instruments, monitor instrument quality, and compute respondents’ 
performances” (Boone, 2016). In Rasch measurement, raw scores are to be converted into logarithmically scaled 
measures of interval levels. Estimates of personal ability (i.e., student SRA in the present study) and item dif- 
ficulty (i.e., the difficulty level of SRA test items in the present study) can thus be placed together on a single 
continuum. As measures of items and persons are sample and item independent, comparisons can be made 
regardless of the sample chosen or items selected for assessment as long as they measure the same construct 
(Bond & Fox, 2007). Due to the simple yet strong rationale, Rasch modeling has been extensively adopted in 
psychometric research on the development and validation of measurement instruments (Wei et al., 2014). The 
popularity of Rasch model in science education research provides reference for test development and analysis 
on teacher classroom performance and student learning achievement (Liu, 2010; Randall, 2010; Wei et al., 2014). 

According to the assumptions of Rasch modeling, multiple rounds of the test are required to measure data 
fitness, and the collected empirical data should meet the specified criteria and structure for objective measure- 
ment (Liu, 2010; Linacre, 2006; 2011). In this study, the holistic and iterative development process implemented, 
including 1) pilot test and test refinement, 2) main study of LCTSR and SRA test; 3) Rasch analysis and further 
refinement (if any) could provide ample empirical evidence to validate and improve the SRA assessment, as well 
as exemplify its application in assessing SRA in K-12 education settings. 


Research Results 
The Results of Pilot Test 


Data analysis of student scores in the pilot test using SPSS 22.0 revealed satisfying reliability of the SRA test 
(Cronbach's a=.706). It was appropriate for students at different grades as their scores were distributed (mostly 
ranging from 9 and 25; Mean=16.41, SD=4.492), and they could all finish the test within the designated time. Student 
feedback reflected the SRA test was for evaluating their general “thinking” or“intelligence’, not for specific science 
knowledge. Students also reflected that some of the test items were difficult to understand, which negatively 
impacted their score. In the following, the phrasing of these items was revised accordingly. 


The Results of the LCTSR and SRA Test 


Altogether, the test scores of 552 students were put into analysis. The correlation between their LCTSR scores 
and SRA test scores was .527 (Pearson coefficient, p = .000, N=552). Such correlation indicated that, in accordance 
with LCTSR, the SRA test possessed good and statistically significant pragmatic validity and would help measure 
students’ SRA. Additionally, the Cronbach's a of the SRA test was .809 (N=593), indicating good reliability of the 
test. It proved that all the 25 test items were measuring the very same construct of SRA. The highest score obtained 
was 30 (full score) and the lowest was 1 (Vean=15.82, SD=5.917). Although girls, in general, scored lower than the 
boys did (boy: Mean=15.90, SD=6.042, N=306; girl: Mean=15.74, SD=5.790, N=287), such gender difference was 
not significant (t = .341, p = .733). 
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The Results of Rasch Analysis 


In the following Rasch modeling, the overall analysis, item uni-dimensionality test, item fitness, and item 
distribution were operated using WINSTEP 3.72.0 software and with reference to the user manual (Linacre, 2006; 
2011) and complementary studies (Liu, 2010; Sondergeld & Johnson, 2014) for indices criteria. Table 5 presents a 
summary of all persons and items analysis results. The estimated ability value of all participants is 0.35 (in logits), 
a little higher than the total item difficulty, which is generally set as 0. In Rasch modeling, the Error reflects the ac- 
curacy of parameter estimation, and the closer the Error to 0 (in logits), the better the test. Based on the suggested 
criteria, person ability and item difficulty error are both acceptable, yet could be improved upon. Person separation 
index and reliability coefficient are acceptable as well, indicating good item indices. The MNSQ and ZSTD values 
are nearly perfect as both INFIT and OUTFIT were between 70 and 1.3. The results confirmed that the SRA test was 
reliable and accountable for measuring students’ SRA. 


Table 5 
Summary of person and item estimates and indices 


Infit Outfit 
Measure Ero, © _ se SS paration Reliability 
MNSQ ZSTD MNSQ ZSTD 
Person 0.35 0.47 1.00 0.0 1.03 0.0 1.92 0.79 
Item 0.00 0.10 1.00 0.0 1.02 0.1 9.98 0.99 


Note: value all in logits. 


Table 6 presents the fitness of all the 25 SRA test items. Overall, these statistics fit the Rasch model, further 
confirming the validity of the SRA test. The standard error (S.E.) for all items is below 1.0 (in logits), ranging from 
.07 to .14. The Outfit and Infit index of most items are all acceptable as the MNSQ of all items were below 1.3, only 
except for P24. For P24, the ZSTDs are both -9.9 and the MNSQs are both below 0.5, suggesting the need for revi- 
sion or even elimination of the test item. Another important value, PT-MEASURE CORR (i.e., the partial correlation 
between the scores student obtained on the specific item and their total scores) also helps justify the test items. 
According to Liu (2010) and Linacre (2011), the more positive the correlation, the better the test instrument de- 
sign is. For the SRA test, all partial correlations are positive, ranging from 0.24 to 0.60, demonstrating acceptable 
convergent validity of the SRA test. 


Table 6 
Item fitness of the SRA test 


Infit Outfit 
ee aieaciie Model PT-MEASURE 
S.E. CORR. 
MNSQ ZSTD MNSQ ZSTD 
P09 -0.58 10 1.14 3.0 1.3 4.0 0.25 
P11 1.6 10 1.11 2.1 1.29 3.1 0.25 
P12 -1.31 .08 1.22 3.4 1.09 1.2 0.52 
P01 -1.1 mld 1.15 25 1.21 22 0.24 
P04 -1.93 14 0.97 -.3 1.21 1.3 0.33 
P22 0.54 07 1 3 1.2 3.3 0.46 
P10 1.07 09 1.11 2.8 1.19 2.8 0.29 


268 


ASS https://doi.org/10.33225/jbse/20.19.261 


Journal of Baltic Science Education, Vol. 19, No. 2, 2020 


ISSN 1648-3898 /Print/ EVALUATING SCIENTIFIC REASONING ABILITY: THE DESIGN AND VALIDATION OF AN 
co ee ASSESSMENT WITH A FOCUS ON REASONING AND THE USE OF EVIDENCE 
Model um aa PT-MEASURE 
Item Measure SE. _—_—S—Seaoaaoaoaoaoao———GBnsSaaaaoan na — CORR. 
MNSQ ZSTD MNSQ ZSTD 
P06 0.33 09 1.01 4 114 27 0.38 
P15 1.62 40 0.94 13 1.13 14 0.39 
P23 0.19 43 1M a 1.09 12 0.34 
Pig 0.16 09 1.04 12 1.07 13 0.37 
pig 0.25 09 1 4 1.05 1.0 0.4 
Pat 0.32 07 1.04 8 1.01 4 0.59 
Pid 0.56 09 1.02 E 1.04 8 0.39 
P02 0.48 09 1.02 5 0.99 4 04 
P03 0.27 09 1.01 4 1.02 3 0.39 
P16 191 1 1.01 2 1 4 0.33 
PO7 0.18 09 0.98 6 0.92 44 0.44 
P08 427 1 0.95 8 0.96 3 04 
P13 0.38 07 0.96 8 0.92 AA 0.6 
P05 0.21 09 0.9 3.0 0.87 28 05 
P17 0.99 1 0.9 AT 0.83 1.9 0.48 
P20 0.03 40 0.89 32 0.85 29 0.51 
P25 1.81 44 0.84 18 0.63 26 0.47 
pod 0.74 07 0.43 9.9 0.49 9.9 0.54 


Note: value all in logits. 


In the principal components analysis (PCA) in Rasch modeling, the first eigenvalue is 1.8 (< 2.0), meaning the SRA 
items are treated unidimensionally (Linacre, 2011). This result showed the SRA test was only measuring the construct 
of SRA. Figure 1 presents the item loading scatterplot derived from the PCA. The scatterplot shows the contrasts by 
plotting the loading on each component against the item calibration (Linacre, 2006). For a test instrument, if the 
contrast loading of every test item falls within the range of -.4 to +.4, the unidimensionality requirement is satisfied. 
As figure 1 shows, only three items: A(P12), B(P13), and C(P21), had a contrasting loading that fell out of the range. 
Overall, the SRA test met the unidimensionality requirement, providing further proof for its construct validity (Liu, 2010). 
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Figure 1 
Contrast loadings of residuals (standardized residual contrast plot) in principal components analysis 
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Figure 2 is the combined person and item estimate map, called person-item map or Wright map. It displays 
the distribution of persons and items on the same interval, logarithmic scale. The left side of the map shows how 
persons (students) distributed based on their ability (SRA) (“#” represents four persons and “” one person). The 
persons at the top are those getting high SRA, those at the bottom getting low SRA. On the right side of the map, 
the items scattered according to their difficulty levels. Items on the top are at a high level of difficulty and items 
at the bottom are at the low difficulty level. In addition, the CSR framework is added in the right section for easier 
reference. As illustrated, the Wright map of the SRA test demonstrates a good distribution of both students and test 
items. Students were approximately normally distributed based on their SRA. The majority of the test items were 
at the typical difficulty level (indicated by M on the right), which matched the typical ability level of the students 
(indicated by M on the left). The good match between the item concentration pattern and the student ability 
concentration pattern indicated “optimized measurement precision in this instrument’s construction” (Juttner et 
al., 2013), confirming good construct validity of the developed SRA test (Aryadoust, 2009). 

Also, as noted, the item difficulty level, in general, corresponded to the CSR framework. This further validated 
the SRA test constructed. There were some items (P23, P25 in particular) whose actual difficulty level deviated 
from the one prescribed (as shown in Table 4), calling for further modification and improvement of the SRA test. 
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Wright map (person-item map) of the SRA test (N=593, 25 items) 
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According to the Rasch modeling results elaborated above, the SRA test was proved to be reliable and valid, 
and the 6-level CSR framework was appropriate and applicable for distinguishing students’ SRA. 


Discussion 


The present research has developed an integrated analytical framework and assessment tool of students’ SRA 
that emphasizes the use of evidence in reasoning based on literature review, expert validation through a Delphi study, 
and empirical validation using multiple statistical methods (i.e., the correlation between the developed SRA test and 
LCTSR; and Rasch modeling). Data analysis results affirmed the reliability and validity of the SRA framework and test 
that particularly focus on capturing evidence from contextual information in solving science-related problems. Sci- 
ence teachers and other science educators can readily adapt and apply the SRA assessment to analyze and diagnose 
student performance in educational practices that focus on SRA in and for science learning. 
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On the other hand, as reflected by the data, some items of the SRA test (P23 and P25 in particular) need to 
be reconsidered. According to the Rasch analysis, the perceived difficulty level of P23 did not match with the level 
prescribed by the CSR framework. Even though P23 was designed to be a question of low CSR, only a few students 
answered it correctly. The interview data gathered in the pilot study suggested it was because of the inappropriate 
wording of the contextual information (that would be transformed into evidence) that prevented the students from 
providing the correct response. This observation pointed out the inadequacy of expert validation of test design. 
For improving the readability and comprehensibility of test items, the test design should also take opinions of the 
student as the real test-takers as reference. P24 and P25 also have the problem of ambiguous expression based on 
the post-test interview and discussion data. To further improve the soundness of the SRA test, these items need to 
be reinvented or could be eliminated. 

Evaluating and diagnosing student performance with a legitimate assessment is the very first step to improve 
their SRA for science learning. With the assessment results, teachers shall reflect upon their teaching practices in an 
analytical, sustained, and a critical way for enhancing student performance (McNeill & Krajcik, 2011). Besides quantita- 
tive, objective evaluation, teachers may also further recognize and discern students’ SRA via qualitative methods such 
as analyzing student interviews, writing assignments (Keys, 1994), or showcasing and eliciting scientific reasoning in 
the science classroom (Furtak et al., 2010). Engaging students in reasoning processes as the scientists do by providing 
evidence-based guidance would foster their understanding of science, enhance their skills of reasoning, and help 
them become rational when encountering science-related problems in real life (Driver et al., 1994). 

In addition to applying the SRA test as summative assessment to help identify areas for improvement and further 
action in teaching and learning, teachers are also advised to adopt the CSR framework in-classroom observation of 
students’ discourses as they interact with the teacher, peers, and the scientific phenomenon or problem as formative 
assessment to enable assessment for learning. Such real-time contextualized evaluation and feedback enabled by the 
CSR framework can help empower “teaching by inquiry” (Gerber et al., 2001) and the scientific practices (e.g. reason- 
ing based on evidence, communicating and arguing with peers with evidence, solving contextualized problems, or/ 
and trying and doing experiments or surveys, etc.) highly encouraged by NGSS (2013) to enhance students’ scientific 
reasoning abilities (Gerber et al., 2001; Johnson & Lawson, 1998). 


Conclusions and Implications 


In this research, the SRA assessment developed based on both qualitative and quantitative methods, though 
proved valid and reliable to apply further, does have some limitations. Firstly, adopting the paper-pencil test, the 
SRA test did not create authentic contexts where scientists solve real-world problems. The processes of doing sci- 
ence, including raising questions, forming hypotheses, conducting experiments, obtaining data, and providing 
explanations, in which scientific reasoning plays a significant role, were hardly elicited. Furthermore, as the nature of 
scientific reasoning is closely related to the nature of science (NOS) (which is a key element in science curricula), the 
assessment and development of SRA in students should be deeply embedded in the processes of the functioning of 
science, the generation and testing of scientific knowledge, and the working of real scientists (McComas, 2011; Taber, 
2006). In the present study, though the test items were designed to simulate authentic practices of science as much 
as possible, in-depth understanding of science and scientific reasoning still lacked. To collect and interpret holistic, 
process-oriented data on SRA, qualitative approaches are in need to help triangulate data. Considering the richness 
of scientific reasoning in communicative interactions in the science classroom, qualitative methods such as discourse 
analysis and process evaluation could be used combining with the quantitative methods in future investigations. 

Moreover, the selection of subjects for the quantitative method relied on convenient sampling due to limited 
resources, which might negatively impact the generalizability of the findings. In the future, random selection is to 
be performed to produce more convincing results. And the data of student samples at different stages of schooling 
should be put into Rasch modeling to strengthen the test design further. Also, as the basis of the SRA test, the CSR 
framework has been inspected and validated by experts. As only a small group of experts were involved, their opinions 
might be biased or limited. Further iteration will be implemented to improve the assessment framework and the tool. 
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